Skip to main content

Open in Colab

AIMon's LlamaIndex Extension for LLM Response Evaluation

This notebook introduces AIMon's evaluators for the LlamaIndex framework, which are designed to assess the quality and accuracy of responses generated by language models (LLMs) integrated into LlamaIndex. Below is an overview of all available evaluators:

  • Hallucination Evaluator: Detects when a model generates information not supported by the provided context (hallucinations).
  • Guideline Evaluator: Ensures model responses follow predefined instructions and guidelines.
  • Completeness Evaluator: Checks whether the response fully addresses all aspects of the query or task.
  • Conciseness Evaluator: Evaluates if the response is brief yet complete, avoiding unnecessary verbosity.
  • Toxicity Evaluator: Flags harmful, offensive, or inappropriate language in the response.
  • Context Relevance Evaluator: Assesses the relevance and accuracy of the provided context in supporting the model's response.

In this notebook, we will focus on utilizing the Hallucination Evaluator, Guideline Evaluator, and Context Relevance Evaluator to improve your RAG (Retrieval-Augmented Generation) applications.

To learn more about AIMon, check out these resources: Website and Documentation

Prerequisites

Let's get started by installing the dependencies and setting up the API keys.

%%capture
!pip install requests datasets aimon-llamaindex llama-index-embeddings-openai llama-index-llms-openai

Configure your OPENAI_API_KEY and AIMON_API_KEY in Google Collab secrets and provide them notebook access. We will use OpenAI for the LLM and embedding generation models. We will use AIMon for continuous monitoring of quality issues.

AIMon API key can be obtained here.

import os
import json
# Import Colab Secrets userdata module.
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

Dataset for evaluation

In this example, we will be using the transcripts from MeetingBank dataset as our contextual information.

%%capture
from datasets import load_dataset
meetingbank = load_dataset("huuuyeah/meetingbank")

This function helps extract transcripts and converts them into a list of objects of type llama_index.core.Document.

from llama_index.core import Document

def extract_and_create_documents(transcripts):

documents = []

for transcript in transcripts:

try:

doc = Document(text=transcript)
documents.append(doc)

except Exception as e:
print(f"Failed to create document")

return documents

transcripts = [meeting['transcript'] for meeting in meetingbank['train']]
documents = extract_and_create_documents(transcripts[:5]) ## Using only 5 transcripts to keep this example fast and concise.

Set up an embedding model. We will be using the text-embedding-3-small model here.

from llama_index.embeddings.openai import OpenAIEmbedding
embedding_model = OpenAIEmbedding(model="text-embedding-3-small", embed_batch_size=100, max_retries = 3)

Split documents into nodes and generate their embeddings

from aimon_llamaindex import generate_embeddings_for_docs
nodes = generate_embeddings_for_docs(documents, embedding_model)

Insert the nodes with embeddings into in-memory Vector Store Index.

from aimon_llamaindex import build_index

index = build_index(nodes)

Instantiate a Vector Index Retrieiver

from aimon_llamaindex import build_retriever

retriever = build_retriever(index, similarity_top_k=5)

Building the LLM Application

Configure the Large Language Model. Here we choose OpenAI's gpt-4o-mini model with temperature setting of 0.1.

## OpenAI's LLM
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o-mini",
temperature=0.4,
system_prompt = """
Please be professional and polite.
Answer the user's question in a single line.
Even if the context lacks information to answer the question, make
sure that you answer the user's question based on your own knowledge.
"""
)

Define your query and instructions

user_query = "Which council bills were amended for zoning regulations?"
user_instructions = "1. Keep the response concise, preferably under the 100 word limit."

Update the LLM's system prompt with the user's instructions defined dynamically

llm.system_prompt += f"Please comply to the following instructions {user_instructions}."

Retrieve a response for the query.

from aimon_llamaindex import get_response
llm_response = get_response(user_query, retriever, llm)

Running Evaluations using AIMon

Configure AIMon Client

from aimon import Client
aimon_client = Client(auth_header="Bearer {}".format(userdata.get('AIMON_API_KEY')))

Using AIMon’s Instruction Adherence Model (a.k.a. Guideline Evaluator)

This model evaluates if generated text adheres to given instructions, ensuring that LLMs follow the user’s guidelines and intent across various tasks for more accurate and relevant outputs.

from aimon_llamaindex.evaluators import GuidelineEvaluator

guideline_evaluator = GuidelineEvaluator(aimon_client)
evaluation_result = guideline_evaluator.evaluate(user_query, user_instructions, llm_response)
## Printing the initial guideline adherence result
print(json.dumps(evaluation_result, indent=4))
    {
"results": [
{
"adherence": true,
"detailed_explanation": "The response is concise, consisting of only 21 words. It directly answers the user query regarding council bills amended for zoning regulations without including extraneous information.",
"instruction": "Keep the response concise, preferably under the 100 word limit."
}
],
"score": 1.0
}
## Running a loop to improve guideline adherence
current_attempt = 1
max_attempts = 2
reattemped_g = False

while(evaluation_result['results'][0]['adherence']!=True):
if current_attempt > max_attempts:
break

reattemped_g = True
llm.system_prompt = f"""The last LLM response failed to comply with the instructions.\
Please adhere to the following instructions while generating the next response: {user_instructions}"""
llm_response = get_response(user_query, retriever, llm)
evalution_result = guideline_evaluator.evaluate(user_query, user_instructions, llm_response)
current_attempt+=1
## Printing the final guideline adherence result, if the LLM was prompted again
if reattemped_g == True:
print(json.dumps(evaluation_result, indent=4))
else:
print("Instructions were complied with in the first prompt to the LLM.")

Instructions were complied with in the first prompt to the LLM.

Using AIMon’s Hallucination Detection Evaluator Model (HDM-1) to improve the quality of responses obtained by the LLM application.

AIMon’s HDM-1 detects hallucinated content in LLM outputs. It provides a “hallucination score” (0.0–1.0) quantifying the likelihood of factual inaccuracies or fabricated information, ensuring more reliable and accurate responses.

from aimon_llamaindex.evaluators import HallucinationEvaluator

hallucination_evaluator = HallucinationEvaluator(aimon_client)
evalution_result = hallucination_evaluator.evaluate(user_query, user_instructions, llm_response)
## Printing the initial evaluation result for Hallucination
print(json.dumps(evalution_result, indent=4))
    {
"is_hallucinated": "False",
"score": 0.17421,
"sentences": [
{
"score": 0.17421,
"text": "The council bills amended for zoning regulations include the small lot moratorium and the text amendment language related to development."
}
]
}
## Let hallucination threshold be at 0.65
hallucination_threshold = 0.65

## Let maximum attempts to reduce hallucination be 2
max_attempts = 2
current_attempt = 1
reattemped_h = False

while(evalution_result['score'] > 0.65):

if current_attempt > max_attempts:
break

reattemped_h = True
llm.system_prompt=f"The latest LLM response obtained was {llm_response}\
Upon evaluation, it was found out that this LLM response hallucinated with a score of {evalution_result['score']}\
Please generate a new response, and take into consideration the following factors:\
a. If the hallucination score is between 0 and 0.5, no special action is required for improving accuracy.\
b. If the hallucination score is between 0.5 and 0.75, please focus on reducing any noticeable inaccuracies, while ensuring the response is as reliable as possible.\
c. If the hallucination score is between 0.75 and 1, take extra care to minimize significant inaccuracies and ensure that the response is mostly factual and reliable, avoiding any fictional or misleading content.\
Also, make sure to comply with the following instructions {user_instructions} while generating new responses.\
"
llm_response = get_response(user_query, retriever, llm)
evalution_result = hallucination_evaluator.evaluate(user_query, user_instructions, llm_response)
current_attempt+=1
## Print the final evaluation result for Hallucination, if the LLM was prompted again
if reattemped_h == True:
print(json.dumps(evalution_result, indent=4))
else:
print("The LLM response received in the first run was not hallucinated. Therefore, the LLM was not prompted again to reduce hallucination.")

The LLM response received in the first run was not hallucinated. Therefore, the LLM was not prompted again to reduce hallucination.

Using AIMon's Context Relevance Evaluator to evaluate the relevance of context data used by the LLM to generate the response.

## Printing the source documents used to generate this response
print(llm_response.get_formatted_sources())

Source (Doc id: b67d14bd-00f4-43c7-a6f3-48abeae673e1): Excuse me. Over the past 12 months, the forum has several times voted and sent letters of support...

Source (Doc id: 45249fbb-1b91-4fd3-8a61-de1f76fc8987): It's still expensive, still absolutely expensive to live in Denver, no matter where it is. Everyb...

Source (Doc id: e65ec5cf-4420-4da5-8157-81236770e474): I wondered why was it implemented on East Colfax, but never used anywhere else or never used on E...

Source (Doc id: 79a3e0db-d785-4342-bf2d-fa9bf57adb9b): So here are if you really want to go after affordable housing. Give somebody a living wage so tha...

Source (Doc id: f3396525-3c26-4864-a2e6-a557cf5b311e): It has never happened. Responding to the request, all of the neighborhoods. We did do a parking p...

from aimon_llamaindex.evaluators import ContextRelevanceEvaluator

evaluator = ContextRelevanceEvaluator(aimon_client)
task_definition = "Find the relevance of the context data used to generate this response."
evaluation_result = evaluator.evaluate(user_query, user_instructions, llm_response, task_definition)
print(json.dumps(evaluation_result, indent=4))
    [
{
"explanations": [
"Document 1 discusses a text amendment related to zoning and parking issues, indicating that there are ongoing discussions about zoning regulations in the context of development. However, it does not specifically mention which council bills were amended for zoning regulations, making it less relevant to the query.",
"2. Document 2 touches on the topic of zoning and development but focuses more on affordability and transit issues rather than explicitly detailing any council bills or amendments related to zoning regulations. This lack of direct reference to specific bills means it does not adequately address the query.",
"3. Document 3 mentions a zoning exemption and discusses parking regulations but does not specify which council bills were amended for zoning regulations. While it acknowledges the need for balance in zoning laws, it fails to provide concrete examples of amendments, rendering it less relevant to the query.",
"4. Document 4 highlights the need for a multi-modal city and critiques existing plans but does not mention any specific council bills or amendments related to zoning regulations. Although it discusses urban planning, it does not provide the necessary details to answer the query directly.",
"5. Document 5 references a compromise ordinance amending the zoning code to provide off-street parking exemptions, which is relevant to zoning regulations. However, it does not specify the particular council bills that were amended, which is a key aspect of the query, thus limiting its relevance."
],
"query": "Which council bills were amended for zoning regulations?",
"relevance_scores": [
0.40797651372763255,
0.4011011577796637,
0.4522964940034875,
0.39406661077237004,
0.43046392545075207
]
}
]

Conclusion

In this notebook, we built a simple RAG application using the LlamaIndex framework. After retrieving a response to a query, we assessed it with AIMon’s evaluators and used the evaluation results to refine the model’s performance.